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Outline 


•  Embedded  computing  demands  high  arithmetic  rates 
with  low  power 

•  VLSI  technology  can  deliver  this  capability  -  but 
microprocessors  cannot 

•  Stream  processors  realize  the  performance/power 
potential  of  VLSI  while  retaining  flexibility 
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Embedded  systems  demand  high  arithmetic 
rates  with  low  power 
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For  N=10,  BW=100MHz,  S=16,  B=4,  about  SOOGOPs 
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VLSI  provides  high  arithmetic  rates  with  low 
power  -  microprocessors  do  not 


PowerPC  G4 
95mnn2  InJ/op 


!E35 


l'iri)Vinir-l<»MHIIli  I 

•mill*’  '■liiBi*-  •iiuii*’  ■iiiiir* 

^  .■  ■ 


■WPS .  S3WS  iJWC  BWiC  iJIBS . 

.liS.  mm,  mm.,  mm.  .mW.. 


'iM  .' 


32b  adder  +  RF,  512  x  163  tracks 
205pm  X  65pm  0.013mm2 
'^5pJ/op 


Area  7300:1,  Energy  200:1,  Ops  4:1 


stream  Proc:  4 


Sept  25,  2002 


VLSI  provides  high  arithmetic  rates  with  low 
power  -  microprocessors  do  not 
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Why  do  Special-Purpose  Processors  Perform 
Well? 


Lots  (100s)  of  ALUs 


Fed  by  dedicated  wires/memories 
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Care  and  Feeding  of  ALUs 
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‘Feeding’  Structure  Dwarfs  ALU 
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stream  Programs  Expose  Locality  and 
Concurrency 


Kernels  exploit  both  Kernels  can  be  partitioned 

instruction  (ILP)  and  data  across  chips  to  exploit  task 

(SIMD)  level  parallelism.  parallelism. 


producer-consumer  parallelism  without  the 

locality.  complexity  of  traditional 

parallel  programming. 
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A  Bandwidth  Hierarchy  exploits  locality  and 
concurrency _ 
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•  VLIW  clusters  with  shared  control 

•  41.2  32-bit  floating-point  operations  per  word  of  memory  BW 
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Producer-Consumer  Locality  in  the  Depth 
Extractor 
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A  Bandwidth  Hierarchy  exploits  kernel  and 
producer-consumer  locality 


2GB/S  32GB/S  544GB/S 
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Bandwidth  (GB/s) 


Bandwidth  Demand  of  Applications 
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Local  registers  increase  effective  size  and 
bandwidth  of  SRF _ 

•  ~90%  of  live  variables  are  captured  in  local  registers 

•  Only  10%  of  live  variables  need  be  stored  in  stream 
register  file 

•  Fixed-size  SRF  is  effectively  lOx  the  size  of  a  VRF  that 
must  hold  all  live  variables 

•  Bandwidth  into  FPUs  is  lOx  the  SRF  bandwidth 
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Cluster  Occupancy  >  80% 
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GOPS 


Performance  demonstrated  on  signal  and 
image  processing _ 
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Prototype 


•  Prototype  of  Imagine  architecture 

-  Proof-of-concept  2.56cm2  die  in  O.lSum 
TI  process,  21M  transistors 

-  Collaboration  with  TI  ASIC 

-  Runs  all  benchmarks  at  240MHz 

•  Dual-Imagine  development  board 

-  Platform  for  rapid  application 
development 

-  Test  &  debug  building  blocks  of  a  64- 
node  system 

-  Collaboration  with  ISI-East 
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Imagine  is  programmed  in 


•  Streams: - 

-  Sequences  of  records 

•  Kernels!  - - 

-  Functions  that  operate  on 
streams 

-  Written  in  KerneiC _ 

-  Compiied  by  kernei  scheduier 

•  Stream  program: 

-  Defines  streams,  controi-  and 
and  data-fiow  between  kerneis 

-  Written  in  StreamC  and  C++ 

-  Compiled  by  stream  compiler 


Stream  Proc:  17 


"C"  at  two  levels 


Sept  25,  2002 


Simple  example 


•  StreamC: 

void  main()  { 

Stream<int>  a(256); 
Stream<int>  b(256) ; 
Stream<int>  c(256); 
Stream<int>  d(1024) ; 

examplel(a,  b,  c) ; 
example2(c,  d) ; 


} 
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•  KernelC: 


KERNEL  examplel( 
istream<int>  a, 
istream<int>  b^ 
ostream<int>  c) 

{ 

loop_stream (a)  { 
int  ai ,  bi ,  ci ; 
a  »  ai; 
b  »  bi; 

ci  =  ai  *  2  +  bi  *  3; 
c  «  ci; 

} 
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Communication  scheduling  achieves  near  optimum 
kernel  performance 
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7x7  convolution  kernel  from  depth  extraction 
application 

(Above)  Single  iteration  schedule 
(Right)  Software  pipelining  shown 
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stream  scheduling  reduces  bandwidth 
demand  by  up  to  12:1  compared  to  caching 


Stream  program 


Open  GL  graphics  pipeline 


Current  DSP  programmers 
attempt  to  stage  data  in 
this  manner  by  hand 


SRF  allocation 
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We  have  developed... 


•  A  stream  architecture  \hdi\.  exploits  locality  and  concurrency 

-  Keeps  99%  of  the  data  accesses  on  chip 

-  Aligned  accesses  to  SRF 

-  Enables  efficient  use  of  large  numbers  (100s)  of  ALUs 

•  Imagine:  a  prototype  stream  processor  demonstrates  the 
efficiency  of  stream  architecture 

-  Working  in  the  lab  at  240MHz 

-  9.6GFLOPS,  19.2GOPS,  6W 

-  Programmed  in  "C" 

-  Sustains  ~5GOPS/W  at  1.2V  (200pJ/OP) 

•  and  demonstrated  image-processing,  signal  processing,  and 
graphics  applications  on  the  Imagine  stream  processor 


stream  Proc:  21 


Sept  25,  2002 


stream  processing  can  be  applied  to 
scientific  computing _ 


•  Extensions  to  architecture 

-  64b  floating  point  -  lOOGFLOPS/chip 

-  Support  2-D,  3-D,  and  irregular  data 
structures 

•  stream  cache 

•  Indexable  SRF 


•  Estimates  suggest  we  can  achieve 

-  <$20/GFLOPS 

-  <$10/M-GUPS 
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Conclusion 


•  Streams  expose  locality  and  concurrency 

-  Concurrency  across  stream  elements 

-  Producer/consumer  locality 

-  Enables  compiler  optimization  at  a  larger  scale  than  scalar  processing 

•  A  stream  architecture  exploits  this  to  achieve  high  arithmetic 
intensity  (arithmetic  rate/BW) 

-  Keeps  most  (>90%)  of  data  operations  local  (544GB/s,  lOpJ)  with  low 
overhead 

-  Keeps  almost  all  (>99%)  of  data  operations  on  chip  (32GB/s,  lOOpJ) 

•  The  Imagine  processor  demonstrates  the  advantages  of 
streaming  for  image  and  signal  processing 

-  9.6GFLOPS,  19.2GOPS,  6W  -  measured 

•  Stream  processing  is  applicable  to  a  wide  range  of  applications 

-  Scientific  computing 

-  Packet  processing 
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